AITopics | image vector

Collaborating Authors

image vector

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Compositional Concept Generalization with Variational Quantum Circuits

Hawashin, Hala, Abbaszadeh, Mina, Joseph, Nicholas, Pearson, Beth, Lewis, Martha, sadrzadeh, Mehrnoosh

arXiv.org Artificial IntelligenceSep-12-2025

Personal use of this material is permitted. Abstract--Compositional generalization is a key facet of human cognition, but lacking in current AI tools such as vision-language models. Previous work examined whether a compositional tensor-based sentence semantics can overcome the challenge, but led to negative results. We conjecture that the increased training efficiency of quantum models will improve performance in these tasks. We interpret the representations of compositional tensor-based models in Hilbert spaces and train V ariational Quantum Circuits to learn these representations on an image captioning task requiring compositional generalization. We used two image encoding techniques: a multi-hot encoding (MHE) on binary image vectors and an angle/amplitude encoding on image vectors taken from the vision-language model CLIP . We achieve good proof-of-concept results using noisy MHE encodings. Performance on CLIP image vectors was more mixed, but still outperformed classical compositional models.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.09541

Country: Europe > United Kingdom > England (0.46)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

Xu, Zhiyang, Chen, Jiuhai, Lin, Zhaojiang, Pan, Xichen, Huang, Lifu, Zhou, Tianyi, Khabsa, Madian, Wang, Qifan, Jin, Di, Yasunaga, Michihiro, Yu, Lili, Lin, Xi Victoria, Nie, Shaoliang

arXiv.org Artificial IntelligenceJul-15-2025

Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2506.10395

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report (0.41)

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing SystemsFeb-7-2025, 12:21:02 GMT

Summary: This paper addresses the task of image-based Q&A on 2 axes: comparison of different models on 2 datasets and creation of a new dataset based on existing captions. Quality: The paper is addressing an important and interesting new topic which has seen recent surge of interest (Malinowski2014, Malinowski2015, Antol2015, Gao2015, etc.). The paper is technically sound, well-written, and well-organized. They achieve good results on both datasets and the baselines are useful to understand important ablations. The new dataset is also much larger than previous work, allowing training of stronger models, esp. However, there are several weaknesses: their main model is not very different from existing work on image-Q&A (Malinowski2015, who also had a VIS LSTM style model (but they were also jointly training the CNN and RNN, and also decoding with RNNs to produce longer answers) and achieves similar performance (except that adding bidirectionality and 2-way image input helps).

author feedback and meta-review, dataset, malinowski2015, (12 more...)

Neural Information Processing Systems

Genre: Summary/Review (0.71)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.58)

Add feedback

Unsupervised Deep Learning Image Verification Method

Solomon, Enoch, Woubie, Abraham, Emiru, Eyael Solomon

arXiv.org Artificial IntelligenceFeb-6-2024

Although deep learning are commonly employed for image recognition, usually huge amount of labeled training data is required, which may not always be readily available. This leads to a noticeable performance disparity when compared to state-of-the-art unsupervised face verification techniques. In this work, we propose a method to narrow this gap by leveraging an autoencoder to convert the face image vector into a novel representation. Notably, the autoencoder is trained to reconstruct neighboring face image vectors rather than the original input image vectors. These neighbor face image vectors are chosen through an unsupervised process based on the highest cosine scores with the training face image vectors. The proposed method achieves a relative improvement of 56\% in terms of EER over the baseline system on Labeled Faces in the Wild (LFW) dataset. This has successfully narrowed down the performance gap between cosine and PLDA scoring systems.

face image vector, image vector, vector, (14 more...)

arXiv.org Artificial Intelligence

2312.14395

Country:

North America > United States > Virginia > Richmond (0.04)
North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Language with Vision: a Study on Grounded Word and Sentence Embeddings

Shahmohammadi, Hassan, Heitmeier, Maria, Shafaei-Bajestan, Elnaz, Lensch, Hendrik P. A., Baayen, Harald

arXiv.org Artificial IntelligenceOct-31-2023

Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations by incorporating perceptual knowledge from vision into text-based representations. Despite many attempts at language grounding, achieving an optimal equilibrium between textual representations of the language and our embodied experiences remains an open field. Some common concerns are the following. Is visual grounding advantageous for abstract words, or is its effectiveness restricted to concrete words? What is the optimal way of bridging the gap between text and vision? To what extent is perceptual knowledge from images advantageous for acquiring high-quality embeddings? Leveraging the current advances in machine learning and natural language processing, the present study addresses these questions by proposing a simple yet very effective computational grounding model for pre-trained word embeddings. Our model effectively balances the interplay between language and vision by aligning textual embeddings with visual information while simultaneously preserving the distributional statistics that characterize word usage in text corpora. By applying a learned alignment, we are able to indirectly ground unseen words including abstract words. A series of evaluations on a range of behavioural datasets shows that visual grounding is beneficial not only for concrete words but also for abstract words, lending support to the indirect theory of abstract concepts. Moreover, our approach offers advantages for contextualized embeddings, such as those generated by BERT, but only when trained on corpora of modest, cognitively plausible sizes. Code and grounded embeddings for English are available at https://github.com/Hazel1994/Visually_Grounded_Word_Embeddings_2.

grounded word, representation, springer nature 2021, (14 more...)

arXiv.org Artificial Intelligence

2206.08823

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
Africa > Kenya > Mandera County > Mandera (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(23 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Education (1.00)
Government (0.92)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

Gu, Sophia, Clark, Christopher, Kembhavi, Aniruddha

arXiv.org Artificial IntelligenceAug-18-2023

Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether it is possible to learn those skills from text data and then transfer them to vision tasks without ever training on visual training data. Key to our approach is exploiting the joint embedding space of contrastively trained vision and language encoders. In practice, there can be systematic differences between embedding spaces for different modalities in contrastive models, and we analyze how these differences affect our approach and study strategies to mitigate this concern. We produce models using only text training data on four representative tasks: image captioning, visual entailment, visual question answering and visual news captioning, and evaluate them on standard benchmarks using images. We find these models perform close to models trained on images, while surpassing prior work for captioning and visual entailment in this text-only setting by over 9 points, and outperforming all prior work on visual news by over 30 points. We also showcase a variety of stylistic image captioning models that are trained using no image data and no human-curated language data, but instead using readily-available text data from books, the web, or language models.

caption, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2211.09778

Country: Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)

Add feedback

N way K Shot: Siamese Network with Contrastive Loss for pokemon Classification

#artificialintelligenceSep-18-2022, 23:55:15 GMT

When we have a tiny dataset, Few shot learning can be applied. A Siamese network with contrastive loss is one of the few-shot learning algorithms. Let's first examine the differences between Neural networks and Siamese networks before briefly moving on to Siamese. This is merely an intuitive understanding of the siamese network; the preprocessing and training will differ slightly from those of neural networks, and I'll go into more detail about how it functions in a moment. Deep learning is always data-hungry; the more data, the better the performance. For neural network training, we need at least a few thousand data; otherwise, the network will overfit, and even with regularisation and fine-tuning, Low precision is expected.

dataset, image vector, vector, (10 more...)

#artificialintelligence

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.05)

Industry: Leisure & Entertainment > Games > Computer Games (0.41)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Face recognition using PCA integrated with Delaunay triangulation

Adeshara, Kavan, Elangovan, Vinayak

arXiv.org Artificial IntelligenceNov-25-2020

Face Recognition is most used for biometric user authentication that identifies a user based on his or her facial features. The system is in high demand, as it is used by many businesses and employed in many devices such as smartphones and surveillance cameras. However, one frequent problem that is still observed in this user-verification method is its accuracy rate. Numerous approaches and algorithms have been experimented to improve the stated flaw of the system. This research develops one such algorithm that utilizes a combination of two different approaches. Using the concepts from Linear Algebra and computational geometry, the research examines the integration of Principal Component Analysis with Delaunay Triangulation; the method triangulates a set of face landmark points and obtains eigenfaces of the provided images. It compares the algorithm with traditional PCA and discusses the inclusion of different face landmark points to deliver an effective recognition rate.

algorithm, delaunay triangulation, recognition, (12 more...)

arXiv.org Artificial Intelligence

2011.12786

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
North America > United States > Washington > King County > Seattle (0.04)
Europe > Czechia > Prague (0.04)
Asia > China > Sichuan Province > Chengdu (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback